3 datatsets were collected from Gapminder to serve as samples for this analysis population_total.csv, gnicap_atm_con.csv, life_expectancy_years.csv__
The first question we'll be exploring is: Is life expectancy affected by the population size?
The second question we'll be exploring is: Is there a correlation between gdp and life expectancy?
For the purpose of ease through out the analysis, I will label gnicap_atm_con as df_inc.
#Importing library packages to be used throughout project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
# Load data
df_pop = pd.read_csv('population_total.csv')
df_lyf = pd.read_csv('life_expectancy_years.csv')
df_inc = pd.read_csv('gnicap_atm_con.csv')
#confirming right data loaded(population)
print(df_pop.shape)
df_pop.head()
(197, 302)
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | ... | 76.6M | 76.4M | 76.3M | 76.1M | 76M | 75.8M | 75.6M | 75.4M | 75.2M | 74.9M |
| 1 | Angola | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | ... | 168M | 170M | 172M | 175M | 177M | 179M | 182M | 184M | 186M | 188M |
| 2 | Albania | 400k | 402k | 404k | 405k | 407k | 409k | 411k | 413k | 414k | ... | 1.33M | 1.3M | 1.27M | 1.25M | 1.22M | 1.19M | 1.17M | 1.14M | 1.11M | 1.09M |
| 3 | Andorra | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | ... | 63k | 62.9k | 62.9k | 62.8k | 62.7k | 62.7k | 62.6k | 62.5k | 62.5k | 62.4k |
| 4 | United Arab Emirates | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | ... | 12.3M | 12.4M | 12.5M | 12.5M | 12.6M | 12.7M | 12.7M | 12.8M | 12.8M | 12.9M |
5 rows × 302 columns
#confirming right data loaded(life expectancy)
print(df_lyf.shape)
df_lyf.head()
(195, 302)
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 75.5 | 75.7 | 75.8 | 76.0 | 76.1 | 76.2 | 76.4 | 76.5 | 76.6 | 76.8 |
| 1 | Angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 78.8 | 79.0 | 79.1 | 79.2 | 79.3 | 79.5 | 79.6 | 79.7 | 79.9 | 80.0 |
| 2 | Albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 87.4 | 87.5 | 87.6 | 87.7 | 87.8 | 87.9 | 88.0 | 88.2 | 88.3 | 88.4 |
| 3 | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | United Arab Emirates | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 82.4 | 82.5 | 82.6 | 82.7 | 82.8 | 82.9 | 83.0 | 83.1 | 83.2 | 83.3 |
5 rows × 302 columns
I noticed a lot of null values on the life expectancy dataset
#confirming right data loaded(GNIperCap)
print(df_inc.shape)
df_inc.head()
(191, 252)
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | ... | 751 | 767 | 783 | 800 | 817 | 834 | 852 | 870 | 888 | 907 |
| 1 | Angola | 517.0 | 519.0 | 522.0 | 524.0 | 525.0 | 528.0 | 531.0 | 533.0 | 536.0 | ... | 2770 | 2830 | 2890 | 2950 | 3010 | 3080 | 3140 | 3210 | 3280 | 3340 |
| 2 | Albania | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | ... | 9610 | 9820 | 10k | 10.2k | 10.5k | 10.7k | 10.9k | 11.1k | 11.4k | 11.6k |
| 3 | United Arab Emirates | 738.0 | 740.0 | 743.0 | 746.0 | 749.0 | 751.0 | 754.0 | 757.0 | 760.0 | ... | 47.9k | 48.9k | 50k | 51k | 52.1k | 53.2k | 54.3k | 55.5k | 56.7k | 57.9k |
| 4 | Argentina | 794.0 | 797.0 | 799.0 | 802.0 | 805.0 | 808.0 | 810.0 | 813.0 | 816.0 | ... | 12.8k | 13.1k | 13.4k | 13.6k | 13.9k | 14.2k | 14.5k | 14.8k | 15.2k | 15.5k |
5 rows × 252 columns
df_pop.duplicated().sum(), df_inc.duplicated().sum(), df_lyf.duplicated().sum()
(0, 0, 0)
No duplicated data was found.
df_pop.dtypes.unique()
array([dtype('O')], dtype=object)
df_lyf.dtypes.unique()
array([dtype('O'), dtype('float64')], dtype=object)
df_inc.dtypes.unique()
array([dtype('O'), dtype('float64')], dtype=object)
We see that the 3 datasets all have similarities, all referenced by country and years. But there are a lot of null values in some of the datasets and will have to be cleaned. Dropping the missing value rows will be the best choice for me in order to minimize any errors.
Also, it will be quite complex to work with seperate datasets at once so I'll prefer to transform them into tables with 3 columns each, then merge them together, given that they have the similar references of country and years.
Now, I'll be taking care of the null values using the dropna, from there I will tranform each dataset into tables and finally merge all three tables together!
df_pop.isnull().sum()
country 0
1800 0
1801 0
1802 0
1803 0
..
2096 0
2097 0
2098 0
2099 0
2100 0
Length: 302, dtype: int64
#Dropping all null value rows found in the life expectancy dataset
df_lyf = df_lyf.dropna()
df_lyf
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 75.5 | 75.7 | 75.8 | 76.0 | 76.1 | 76.2 | 76.4 | 76.5 | 76.6 | 76.8 |
| 1 | Angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 78.8 | 79.0 | 79.1 | 79.2 | 79.3 | 79.5 | 79.6 | 79.7 | 79.9 | 80.0 |
| 2 | Albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 87.4 | 87.5 | 87.6 | 87.7 | 87.8 | 87.9 | 88.0 | 88.2 | 88.3 | 88.4 |
| 4 | United Arab Emirates | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 82.4 | 82.5 | 82.6 | 82.7 | 82.8 | 82.9 | 83.0 | 83.1 | 83.2 | 83.3 |
| 5 | Argentina | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | ... | 86.2 | 86.3 | 86.5 | 86.5 | 86.7 | 86.8 | 86.9 | 87.0 | 87.1 | 87.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 190 | Samoa | 25.4 | 25.4 | 25.4 | 25.4 | 25.4 | 25.4 | 25.4 | 25.4 | 25.4 | ... | 79.8 | 79.9 | 80.0 | 80.1 | 80.3 | 80.4 | 80.5 | 80.6 | 80.7 | 80.8 |
| 191 | Yemen | 23.4 | 23.4 | 23.4 | 23.4 | 23.4 | 23.4 | 23.4 | 23.4 | 23.4 | ... | 76.9 | 77.0 | 77.1 | 77.3 | 77.4 | 77.5 | 77.6 | 77.8 | 77.9 | 78.0 |
| 192 | South Africa | 33.5 | 33.5 | 33.5 | 33.5 | 33.5 | 33.5 | 33.5 | 33.5 | 33.5 | ... | 76.4 | 76.5 | 76.7 | 76.8 | 77.0 | 77.1 | 77.3 | 77.4 | 77.5 | 77.7 |
| 193 | Zambia | 32.6 | 32.6 | 32.6 | 32.6 | 32.6 | 32.6 | 32.6 | 32.6 | 32.6 | ... | 75.8 | 76.0 | 76.1 | 76.3 | 76.4 | 76.5 | 76.7 | 76.8 | 77.0 | 77.1 |
| 194 | Zimbabwe | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | ... | 73.3 | 73.4 | 73.5 | 73.7 | 73.8 | 73.9 | 74.0 | 74.2 | 74.3 | 74.4 |
186 rows × 302 columns
df_lyf.isnull().sum()
country 0
1800 0
1801 0
1802 0
1803 0
..
2096 0
2097 0
2098 0
2099 0
2100 0
Length: 302, dtype: int64
#Making sure there are no null values in the df_inc rows
df_inc = df_inc.dropna()
df_inc
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | ... | 751 | 767 | 783 | 800 | 817 | 834 | 852 | 870 | 888 | 907 |
| 1 | Angola | 517.0 | 519.0 | 522.0 | 524.0 | 525.0 | 528.0 | 531.0 | 533.0 | 536.0 | ... | 2770 | 2830 | 2890 | 2950 | 3010 | 3080 | 3140 | 3210 | 3280 | 3340 |
| 2 | Albania | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | 207.0 | ... | 9610 | 9820 | 10k | 10.2k | 10.5k | 10.7k | 10.9k | 11.1k | 11.4k | 11.6k |
| 3 | United Arab Emirates | 738.0 | 740.0 | 743.0 | 746.0 | 749.0 | 751.0 | 754.0 | 757.0 | 760.0 | ... | 47.9k | 48.9k | 50k | 51k | 52.1k | 53.2k | 54.3k | 55.5k | 56.7k | 57.9k |
| 4 | Argentina | 794.0 | 797.0 | 799.0 | 802.0 | 805.0 | 808.0 | 810.0 | 813.0 | 816.0 | ... | 12.8k | 13.1k | 13.4k | 13.6k | 13.9k | 14.2k | 14.5k | 14.8k | 15.2k | 15.5k |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 186 | Samoa | 373.0 | 373.0 | 373.0 | 373.0 | 373.0 | 373.0 | 373.0 | 374.0 | 374.0 | ... | 5330 | 5440 | 5560 | 5670 | 5790 | 5920 | 6040 | 6170 | 6300 | 6440 |
| 187 | Yemen | 197.0 | 198.0 | 198.0 | 199.0 | 199.0 | 200.0 | 200.0 | 201.0 | 202.0 | ... | 1440 | 1470 | 1500 | 1530 | 1560 | 1590 | 1630 | 1660 | 1700 | 1730 |
| 188 | South Africa | 800.0 | 791.0 | 782.0 | 773.0 | 765.0 | 724.0 | 724.0 | 786.0 | 687.0 | ... | 7630 | 7790 | 7960 | 8130 | 8300 | 8480 | 8660 | 8840 | 9030 | 9220 |
| 189 | Zambia | 213.0 | 214.0 | 215.0 | 215.0 | 215.0 | 216.0 | 216.0 | 217.0 | 217.0 | ... | 1260 | 1290 | 1320 | 1340 | 1370 | 1400 | 1430 | 1460 | 1490 | 1520 |
| 190 | Zimbabwe | 443.0 | 444.0 | 444.0 | 445.0 | 445.0 | 446.0 | 446.0 | 446.0 | 447.0 | ... | 1560 | 1590 | 1620 | 1660 | 1690 | 1730 | 1770 | 1800 | 1840 | 1880 |
190 rows × 252 columns
df_inc.isnull().sum()
country 0
1800 0
1801 0
1802 0
1803 0
..
2046 0
2047 0
2048 0
2049 0
2050 0
Length: 252, dtype: int64
df_pop = df_pop.melt(id_vars=["country"],
var_name="year",
value_name="pop")
df_pop
| country | year | pop | |
|---|---|---|---|
| 0 | Afghanistan | 1800 | 3.28M |
| 1 | Angola | 1800 | 1.57M |
| 2 | Albania | 1800 | 400k |
| 3 | Andorra | 1800 | 2650 |
| 4 | United Arab Emirates | 1800 | 40.2k |
| ... | ... | ... | ... |
| 59292 | Samoa | 2100 | 310k |
| 59293 | Yemen | 2100 | 53.2M |
| 59294 | South Africa | 2100 | 79.2M |
| 59295 | Zambia | 2100 | 81.5M |
| 59296 | Zimbabwe | 2100 | 31M |
59297 rows × 3 columns
df_inc = df_inc.melt(id_vars=["country"],
var_name="year",
value_name="income")
df_inc
| country | year | income | |
|---|---|---|---|
| 0 | Afghanistan | 1800 | 207.0 |
| 1 | Angola | 1800 | 517.0 |
| 2 | Albania | 1800 | 207.0 |
| 3 | United Arab Emirates | 1800 | 738.0 |
| 4 | Argentina | 1800 | 794.0 |
| ... | ... | ... | ... |
| 47685 | Samoa | 2050 | 6440 |
| 47686 | Yemen | 2050 | 1730 |
| 47687 | South Africa | 2050 | 9220 |
| 47688 | Zambia | 2050 | 1520 |
| 47689 | Zimbabwe | 2050 | 1880 |
47690 rows × 3 columns
df_lyf = df_lyf.melt(id_vars=["country"],
var_name="year",
value_name="life_exp")
df_lyf
| country | year | life_exp | |
|---|---|---|---|
| 0 | Afghanistan | 1800 | 28.2 |
| 1 | Angola | 1800 | 27.0 |
| 2 | Albania | 1800 | 35.4 |
| 3 | United Arab Emirates | 1800 | 30.7 |
| 4 | Argentina | 1800 | 33.2 |
| ... | ... | ... | ... |
| 55981 | Samoa | 2100 | 80.8 |
| 55982 | Yemen | 2100 | 78.0 |
| 55983 | South Africa | 2100 | 77.7 |
| 55984 | Zambia | 2100 | 77.1 |
| 55985 | Zimbabwe | 2100 | 74.4 |
55986 rows × 3 columns
df1 = df_pop.merge(df_inc,on=['country','year']).merge(df_lyf,on=['country','year'])
print(df1)
country year pop income life_exp 0 Afghanistan 1800 3.28M 207.0 28.2 1 Angola 1800 1.57M 517.0 27.0 2 Albania 1800 400k 207.0 35.4 3 United Arab Emirates 1800 40.2k 738.0 30.7 4 Argentina 1800 534k 794.0 33.2 ... ... ... ... ... ... 46179 Samoa 2050 267k 6440 74.3 46180 Yemen 2050 48.1M 1730 72.2 46181 South Africa 2050 75.5M 9220 70.9 46182 Zambia 2050 39.1M 1520 69.8 46183 Zimbabwe 2050 23.9M 1880 67.6 [46184 rows x 5 columns]
df1['year']=df1['year'].astype(int)
df1['life_exp']=df1['life_exp'].astype(float)
df1['income'] = df1['income'].replace({'k': '*1e3', 'm': '*1e6'}, regex=True).map(pd.eval).astype(int)
df1['pop'] = df1['pop'].replace({'k': '*1e3', 'M': '*1e6', 'B': '*1e9'}, regex=True).map(pd.eval).astype(int)
df1.dtypes
country object year int32 pop int32 income int32 life_exp float64 dtype: object
df1.duplicated().sum()
0
df1.isnull().sum()
country 0 year 0 pop 0 income 0 life_exp 0 dtype: int64
#trim data, working with data from year:1980-2020
df1=df1.loc[33153:40513]
df1.head()
| country | year | pop | income | life_exp | |
|---|---|---|---|---|---|
| 33153 | Cameroon | 1980 | 8620000 | 2260 | 55.4 |
| 33154 | Congo, Dem. Rep. | 1980 | 26400000 | 533 | 52.1 |
| 33155 | Congo, Rep. | 1980 | 1780000 | 1270 | 52.8 |
| 33156 | Colombia | 1980 | 26900000 | 2170 | 68.9 |
| 33157 | Comoros | 1980 | 308000 | 1340 | 54.5 |
Now that the data is cleaned, trimmed and set, we can now move to the analysis. Let's get it!!
The idea is to find out if the life expectancy of less developed countries differ significantly from that of the developed countries, we will also to verifying is their populations have any impact on life expectancy
H0: Life expectancy in developed countries = life expectancy in less developed countries
H1: Life expectancy in developed countries != life expectancy in less developed countries
#query data with country = Cameroon
df_cm = df1.query('country == "Cameroon"')
df_cm.head()
| country | year | pop | income | life_exp | |
|---|---|---|---|---|---|
| 33153 | Cameroon | 1980 | 8620000 | 2260 | 55.4 |
| 33337 | Cameroon | 1981 | 8890000 | 2360 | 55.8 |
| 33521 | Cameroon | 1982 | 9170000 | 2460 | 56.3 |
| 33705 | Cameroon | 1983 | 9460000 | 2590 | 56.8 |
| 33889 | Cameroon | 1984 | 9760000 | 2720 | 57.1 |
#Write a function to plot layouts; this is to avoid duplicates and confusion
def plotter(x):
test = x.update_layout(barmode='group', xaxis_tickangle=-45, title={
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
return test
#df_fr['life_exp'].hist(figsize=(15,5));
fig = px.histogram(df_cm, x='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Life Expectancy Histogram - Cameroon');
plotter(fig)
We deduce from the above histogram that life expectancy in Cameroon from 1980-2020 ranges from 54-63 Years with highest between ages 57-59 Years, skewed to the left, signifying in most cases life expectancy is between 54-59 Years
#query data with country = France
df_fr = df1.query('country == "France"')
df_fr.head()
| country | year | pop | income | life_exp | |
|---|---|---|---|---|---|
| 33176 | France | 1980 | 53900000 | 29000 | 74.7 |
| 33360 | France | 1981 | 54100000 | 29200 | 74.9 |
| 33544 | France | 1982 | 54400000 | 29800 | 75.1 |
| 33728 | France | 1983 | 54700000 | 30100 | 75.3 |
| 33912 | France | 1984 | 55000000 | 30400 | 75.6 |
#df_fr['life_exp'].hist(figsize=(15,5));
fig = px.histogram(df_fr, x='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Life Expectancy Histogram - France')
plotter(fig)
We deduce from the above histogram that life expectancy France from 1980-2020 is highest after 82 Years. It is normally distributed with the range of life expectancy between 75-83 Years, significantly higher than that of Cameroon(54-63 Years)
#plotting relationship between life expectancy and years
#fig1=px.bar(df_cm, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'})
#fig1.update_layout(barmode='group', xaxis_tickangle=-45)
fig = px.bar(df_cm, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - Cameroon')
plotter(fig)
fig = px.scatter(df_cm, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - Cameroon');px.scatter(df_fr, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - France');
plotter(fig)
From the bar chart and scatter plot above, we observe a steady rise from 2002 signifying that life expectancy in Cameroon has grown from 54 to 63+ over the last 20 years
fig = px.bar(df_fr, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - France')
plotter(fig)
fig = px.scatter(df_fr, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - France')
plotter(fig)
From the bar chart and the scatter plot above, we can see a steady rise through out the years signifying that life expectancy in France has grown from 75 to 82+ over the last 10+ years
fig = px.bar(df_cm, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population Cameroon'}, title='Life Expectancy with Respect to Pop Growth per Year - Cameroon')
plotter(fig)
fig=px.scatter(df_cm, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population Cameroon'}, title='Life Expectancy with Respect to Pop Growth per Year - Cameroon')
plotter(fig)
fig=px.bar(df_fr, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population France'}, title='Life Expectancy with Respect to Pop Growth per Year - France')
plotter(fig)
fig=px.scatter(df_fr, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population France'}, title='Life Expectancy with Respect to Pop Growth per Year - France')
plotter(fig)
The relationship graphs above clearly show the differences and the steady rise in life expectancy in France compared to Cameroon, with France having a higher population of 64.7M people and Cameroon with just 26.5M people; by 2019, France life expectancy was already at 82+ Years, far higher than Cameroon's(62+ Years in 2020).
#compute to get descriptive statistics
df_cm.describe()
| year | pop | income | life_exp | |
|---|---|---|---|---|
| count | 41.000000 | 4.100000e+01 | 41.000000 | 41.000000 |
| mean | 2000.000000 | 1.625122e+07 | 1672.926829 | 57.495122 |
| std | 11.979149 | 5.311372e+06 | 476.358079 | 2.438642 |
| min | 1980.000000 | 8.620000e+06 | 976.000000 | 54.200000 |
| 25% | 1990.000000 | 1.180000e+07 | 1390.000000 | 55.700000 |
| 50% | 2000.000000 | 1.550000e+07 | 1590.000000 | 57.200000 |
| 75% | 2010.000000 | 2.030000e+07 | 1750.000000 | 58.500000 |
| max | 2020.000000 | 2.650000e+07 | 2720.000000 | 63.500000 |
#compute to get descriptive statistics
df_fr.describe()
| year | pop | income | life_exp | |
|---|---|---|---|---|
| count | 40.000000 | 4.000000e+01 | 40.000000 | 40.000000 |
| mean | 1999.500000 | 5.945750e+07 | 40265.000000 | 79.152500 |
| std | 11.690452 | 3.550362e+06 | 7163.209065 | 2.599999 |
| min | 1980.000000 | 5.390000e+07 | 29000.000000 | 74.700000 |
| 25% | 1989.750000 | 5.662500e+07 | 34650.000000 | 77.150000 |
| 50% | 1999.500000 | 5.885000e+07 | 40700.000000 | 79.050000 |
| 75% | 2009.250000 | 6.260000e+07 | 45450.000000 | 81.450000 |
| max | 2019.000000 | 6.510000e+07 | 52800.000000 | 82.900000 |
#df_cm['income'].hist(figsize=(15,5));
fig = px.histogram(df_cm, x='income', height=320, labels={'income':'Income'}, title='Income Histogram - Cameroon');
plotter(fig)
Income histogram is skewed to the left, with highest frequency between 1500-1700USD, very few people above 1750USD
#Relationship between life expectancy and income with respect to the population growth in Cameroon on bar chart
fig=px.bar(df_cm, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - Cameroon')
plotter(fig)
#Scatter plot for relationship between life expectancy and income with respect to the population growth in Cameroon on Scatter chart
fig=px.scatter(df_cm, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - Cameroon')
plotter(fig)
Majority of the population live with GNI below 2000USD. All those with GNI above 2000USD do not reach 59 Years
#df_fr['income'].hist(figsize=(15,5));
fig = px.histogram(df_fr, x='income', height=320, labels={'income':'Income'}, title='Income Histogram - France');
plotter(fig)
#Relationship between life expectancy and income with respect to the population growth in France on bar chart
fig= px.bar(df_fr, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - France')
plotter(fig)
#Scatter plot for relationship between life expectancy and income with respect to the population growth in France on scatter chart
fig= px.scatter(df_fr, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - France')
plotter(fig)
Comparing the plots for Cameroon and France, we see some differences, firstly, life expectancy decreases with increase in GNI with respect to the population of Cameroon. In contrast, life expectancy increases with respect to the french population.
From the analysis carried out;
We found using the case studies that France have have a higher life expectancy range(82+ Years) than Cameroon(64+ Years).
We see that life expectancy had a steady rise for 20+ years in both Cameroon and France, which could mean that the increase in population has no negative effect on the life expectancy.
The mean GNI of Cameroon=1672.9 while the mean GNI of France=40265,this also points to a higher standard of living in the developed countries.
We see in the analysis that Life expectancy is lower for GNI>2000USD and higher for GNI<2000USD, but with France, life expectancy increases with increase in GNI.
In conclusion, population size does has an effect on life expectancy; as we can see on the bar and scatter plots, life expectancy gets higher as the population size increases. Why?
Finally, income does have an effect on life expectancy but this effect depends on the populationin question, it could either be negative(as seen in the case of Cameroon where an increase in GNI instead leads to a lower life expectancy) or positive(as in the case of France where an Increase in GNI leads to a higher life expectancy).
There are a few limitations with our data:
The statistics is focused more on the descriptive and a little hypothesis testing, so we didn't involve ourselves with inferentials or causatives.
We work with a limited amount of data, due to the presence of a good number of null values that didn't permit us to increase our scope of analysis.